temporal distance
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Pennsylvania (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
VLD: Visual Language Goal Distance for Reinforcement Learning Navigation
Milikic, Lazar, Patel, Manthan, Frey, Jonas
Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-"where to go"-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.
- Europe > Switzerland > Zürich > Zürich (0.40)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Austria > Vienna (0.14)
- (3 more...)
Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations
Myers, Vivek, Zheng, Bill Chunyuan, Eysenbach, Benjamin, Levine, Sergey
Approaches for goal-conditioned reinforcement learning (GCRL) often use learned state representations to extract goal-reaching policies. Two frameworks for representation structure have yielded particularly effective GCRL algorithms: (1) *contrastive representations*, in which methods learn "successor features" with a contrastive objective that performs inference over future outcomes, and (2) *temporal distances*, which link the (quasimetric) distance in representation space to the transit time from states to goals. We propose an approach that unifies these two frameworks, using the structure of a quasimetric representation space (triangle inequality) with the right additional constraints to learn successor representations that enable optimal goal-reaching. Unlike past work, our approach is able to exploit a **quasimetric** distance parameterization to learn **optimal** goal-reaching distances, even with **suboptimal** data and in **stochastic** environments. This gives us the best of both worlds: we retain the stability and long-horizon capabilities of Monte Carlo contrastive RL methods, while getting the free stitching capabilities of quasimetric network parameterizations. On existing offline GCRL benchmarks, our representation learning objective improves performance on stitching tasks where methods based on contrastive learning struggle, and on noisy, high-dimensional environments where methods based on quasimetric networks struggle.
- North America > United States (0.14)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
Seo, Junyoung, Mira, Rodrigo, Haliassos, Alexandros, Bounareli, Stella, Chen, Honglie, Tran, Linh, Kim, Seungryong, Landgraf, Zoe, Shen, Jie
Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Audio-driven human animation aims to generate realistic human videos synchronized with input audio, with widespread applications in film production, virtual assistants, and digital content creation. The advent of Diffusion Transformers (DiTs) (Peebles & Xie, 2022) has significantly advanced this field, enabling natural human video generation not only for portrait videos but also in diverse environments with complex backgrounds (Xu et al., 2024; Chen et al., 2025a). However, current DiT -based models can only handle short clips at a time, typically around 5 seconds, due to the quadratic complexity of diffusion transformer architectures.
Dual Goal Representations
Park, Seohong, Mann, Deepinder, Levine, Sergey
In this work, we introduce dual goal representations for goal-conditioned reinforcement learning (GCRL). A dual goal representation characterizes a state by "the set of temporal distances from all other states"; in other words, it encodes a state through its relations to every other state, measured by temporal distance. This representation provides several appealing theoretical properties. First, it depends only on the intrinsic dynamics of the environment and is invariant to the original state representation. Second, it contains provably sufficient information to recover an optimal goal-reaching policy, while being able to filter out exogenous noise. Based on this concept, we develop a practical goal representation learning method that can be combined with any existing GCRL algorithm. Through diverse experiments on the OGBench task suite, we empirically show that dual goal representations consistently improve offline goal-reaching performance across 20 state- and pixel-based tasks.
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
Liu, Yuyang, Wen, Chuan, Hu, Yihang, Jayaraman, Dinesh, Gao, Yang
Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 interactions per task with the environment. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder can exploit real-world human videos, highlighting its potential as a scalable approach path to rich reward signals from diverse video sources. Mirroring how humans infer task progression by observing others, TimeRewarder distills frame-wise temporal distances from expert videos and converts them into dense reward signals, thereby enabling reinforcement learning free of manually engineered rewards or action annotations. Reinforcement learning (RL) has long served as a principal paradigm for robotic skill acquisition (Ibarz et al., 2021; Tang et al., 2025).
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Pennsylvania (0.04)
STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning
Luan, Yao, Mu, Ni, Yang, Yiqin, Xu, Bo, Jia, Qing-Shan
Preference-based reinforcement learning (PbRL) bypasses complex reward engineering by learning rewards directly from human preferences, enabling better alignment with human intentions. However, its effectiveness in multi-stage tasks, where agents sequentially perform sub-tasks (e.g., navigation, grasping), is limited by stage misalignment: Comparing segments from mismatched stages, such as movement versus manipulation, results in uninformative feedback, thus hindering policy learning. In this paper, we validate the stage misalignment issue through theoretical analysis and empirical experiments. To address this issue, we propose STage-AlIgned Reward learning (STAIR), which first learns a stage approximation based on temporal distance, then prioritizes comparisons within the same stage. Temporal distance is learned via contrastive learning, which groups temporally close states into coherent stages, without predefined task knowledge, and adapts dynamically to policy changes. Extensive experiments demonstrate STAIR's superiority in multi-stage tasks and competitive performance in single-stage tasks. Furthermore, human studies show that stages approximated by STAIR are consistent with human cognition, confirming its effectiveness in mitigating stage misalignment.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Pennsylvania (0.04)
A Algorithms. Algorithm 1 Training DHRL 1: sample D
T time-steps, this upper-bound of error rate is also satisfied in all path from s to g . As shown in the table above, the wider the initial distribution, the easier it is for the agent to explore the map. 'fixed initial state distribution' requires less prior information about the state space. Figure 12: Changes in the graph level over the training; DHRL can explore long tasks with'fixed The results are averaged over 4 random seeds and smoothed equally.